Open Bug 1728784 Opened 4 years ago Updated 21 days ago

Add a stack metric type to Glean

Categories

(Data Platform and Tools :: Glean: SDK, enhancement, P3)

enhancement

Tracking

(Not tracked)

People

(Reporter: mak, Unassigned)

References

Details

Proposal for changing an existing or adding a new Glean metric type

I'd like to propose a simple way to collect part of a stack with Glean, so one can report to telemetry performance issues or unexpected errors with a reference to what caused them.

Who is the individual/team requesting this change?

Mak so far for Storage and Search but it may be useful to other teams too.

Is this about changing an existing metric type or creating a new one?

Creating a new metric type.

Can you describe the data that needs to be recorded?

Part of the stack when the glean collecting function is invoked, the probe definition could contain the depth of the stack to collect.

Can you provide a raw sample of the data that needs to be recorded (this is in the abstract, and not any particular implementation details about its representation in the payload or the database)

Example 1: module A passes a promise to module B, module b races that promise with a timeout because it has to do some work before and after it (database transaction can be an example). If the promise is not resolved, module B on timeout can capture a stack and record that module A passed a promise that took too long.
Example 2: a call to a method fails with a very unexpected error, we'd like to know who is the caller and how often that error happens.

What is the business question/use-case that requires the data to be recorded?

We'd like to understand how often an unexpected error or timeout happens and who is causing it, so that code can be optimized.

How would the data be consumed?

I would like to see aggregated counts per stack.

Why existing metric types are not enough?

BHR is using its own very special way to collect stacks, and the only other alternative we have right now is keyed scalars, but you must manipulate the stack to some identifying string. It doesn't tempt people to use it, we could improve quality of the product with a simpler stack collector.

What is the timeline by which the data needs to be collected?

There's no timeline on this, I just think it would be useful long term, in the meanwhile we'll keep using keyes scalars...

See Also: → 1704854
See Also: → 1816744

Does the object metric type serve these purposes?

Flags: needinfo?(mak)

Ah, some time has passed from this report, I may not remeber precisely all the matter.

I think it's not exactly an object. The kind of data we'd like to report is like new Error().stack and it's string-like, with one entry per line, a string can be used, but one has to build a unique string out of the stack.
The stack has unknown length, that's why I was suggesting one could define a maximum depth to collect for the stack. Another issue is row / column numbers that may or may not matter when querying: in some cases you may want precise references, in other cases you just want the function name to tell which function are more likely to cause a problem. Similarly you may just want to group stacks containing a certain caller name. Some of these can likely be done at querying time.

I actually tried to do something like that in https://searchfox.org/mozilla-central/rev/80ae03d93e3fd5769b16f37719b610e359f8fc62/toolkit/modules/Sqlite.sys.mjs#854-855 that is basically only keeping the top caller from the stack, and it is working for our needs.

I just think it would be nice to have a common utility instead of everyone having to do their own stack parsing.

Flags: needinfo?(mak)

It might not be 100% analogous, but in case it might help: in bug 1923657 the observability team added stack traces to the "crash" ping. There might be some commonality of cause that you could make use of.

Having a more common utility might be more of a bug 1704854 thing where, once we agree on a common data format for stacks, we could build that common utility that would capture them into a metric.

Alrighty, looks like this bug remains open and awaiting attention. Might not be a metric type, but then again it might not not be a metric type.

No longer blocks: 1784069
Component: Glean Metric Types → Glean: SDK
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.